Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size of utilities bundled jar #538

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

vamsikarnika
Copy link

@vamsikarnika vamsikarnika commented Sep 17, 2024

What is the purpose of the pull request

  • This pull request removes unused dependencies from the xtable-utilities pom to reduce the size of xtable-utilities bundled jar.

Brief change log

  • Removed unused dependencies from xtable-utilities jar

Verify this pull request

This pull request is a trivial rework / code cleanup without any test coverage.

@vamsikarnika vamsikarnika marked this pull request as draft September 17, 2024 12:49
@vamsikarnika vamsikarnika marked this pull request as ready for review September 17, 2024 13:02
@vamsikarnika
Copy link
Author

With these changes bundled xtable-utilities jar size is coming around 160 MB. it was coming around 600MB before

@vamsikarnika vamsikarnika changed the title Exclude unused dependencies from spark jars Reduce size of utilities bundled jar Sep 17, 2024
pom.xml Show resolved Hide resolved
Comment on lines +55 to +56
</exclusion>
</exclusions>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excluding servlet and antlr-runtime dependencies, since they're not being used in utilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which dependency are these servlet and antlr coming from ? These are not Apache 2.0 compliant actually. Can you check if the ones marked in yellow for xtable-utilities can be removed ?

https://docs.google.com/spreadsheets/d/1XBCZBeWqF2D5d1L4H9fRQ1QLarJq4tLgIEgnGiNdoV8/edit?usp=sharing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

servlet jars are coming from multiple dependencies like hudi-common, and hbase-server as shown below. Antlr is coming from delta-core

[INFO] +- org.apache.xtable:xtable-core:jar:0.2.0-SNAPSHOT:compile
[INFO] | +- org.apache.hudi:hudi-java-client:jar:0.14.0:compile
[INFO] | | - org.apache.hudi:hudi-client-common:jar:0.14.0:compile
[INFO] | | +- org.apache.hudi:hudi-timeline-service:jar:0.14.0:compile
[INFO] | | | +- io.javalin:javalin:jar:4.6.7:compile
[INFO] | | | | +- org.eclipse.jetty:jetty-server:jar:9.4.54.v20240208:compile
[INFO] | | | | | +- org.eclipse.jetty:jetty-http:jar:9.4.54.v20240208:compile
[INFO] | | | | | - org.eclipse.jetty:jetty-io:jar:9.4.54.v20240208:compile
[INFO] | | | | +- org.eclipse.jetty:jetty-webapp:jar:9.4.54.v20240208:compile
[INFO] | | | | | +- org.eclipse.jetty:jetty-xml:jar:9.4.54.v20240208:compile
[INFO] | | | | | - org.eclipse.jetty:jetty-servlet:jar:9.4.54.v20240208:compile
[INFO] | | | | | - org.eclipse.jetty:jetty-security:jar:9.4.54.v20240208:compile
[INFO] | | | | +- org.eclipse.jetty.websocket:websocket-server:jar:9.4.54.v20240208:compile
[INFO] | | | | | +- org.eclipse.jetty.websocket:websocket-common:jar:9.4.54.v20240208:compile
[INFO] | | | | | | - org.eclipse.jetty.websocket:websocket-api:jar:9.4.54.v20240208:compile
[INFO] | | | | | +- org.eclipse.jetty.websocket:websocket-client:jar:9.4.54.v20240208:compile
[INFO] | | | | | | - org.eclipse.jetty:jetty-client:jar:9.4.54.v20240208:compile
[INFO] | | | | | - org.eclipse.jetty.websocket:websocket-servlet:jar:9.4.54.v20240208:compile

] +- org.apache.xtable:xtable-core:jar:0.2.0-SNAPSHOT:compile
[INFO] | +- com.fasterxml.jackson.module:jackson-module-scala_2.12:jar:2.17.1:compile
[INFO] | +- org.apache.hudi:hudi-common:jar:0.14.0:compile
[INFO] | | +- org.apache.hbase:hbase-server:jar:2.4.9:compile
[INFO] | | | +- javax.servlet.jsp:javax.servlet.jsp-api:jar:2.3.1:compile

[INFO] +- org.apache.xtable:xtable-core:jar:0.2.0-SNAPSHOT:compile
[INFO] | +- org.apache.xtable:xtable-hudi-support-utils:jar:0.2.0-SNAPSHOT:compile
[INFO] | - io.delta:delta-core_2.12:jar:2.4.0:compile
[INFO] | +- io.delta:delta-storage:jar:2.4.0:compile
[INFO] | - org.antlr:antlr4-runtime:jar:4.9.3:compile

Copy link
Author

@vamsikarnika vamsikarnika Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see below dependencies are present in the current dependency tree for utilties module that are marked yellow in the sheet. Will try to remove them

jersey-json
jersey-core
jersey-server
jersey-servlet
jaxb-impl
javax.activation-api
javax.annotation-api
javax.servlet.jsp-api
javax.servlet-api
javax.ws.rs-api
jsr311-api
jaxb-api
javax.servlet.jsp
javax.el
jol-core

Copy link
Author

@vamsikarnika vamsikarnika Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xtable sync is failing on removing below non-complainr dependencies

jersey-server
jol-core

<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version.prefix}</artifactId>
<scope>runtime</scope>
</dependency>
Copy link
Author

@vamsikarnika vamsikarnika Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing these dependencies after running mvn dependency:analyze is returning these jars are unused dependencies in the utilities module.

[WARNING] Unused declared dependencies found:
[WARNING] org.apache.logging.log4j:log4j-core:jar:2.22.0:compile
[WARNING] org.apache.logging.log4j:log4j-slf4j2-impl:jar:2.22.0:compile
[WARNING] org.apache.spark:spark-core_2.12:jar:3.4.2:runtime
[WARNING] org.apache.spark:spark-sql_2.12:jar:3.4.2:runtime
[WARNING] org.apache.parquet:parquet-avro:jar:1.12.2:compile
[WARNING] org.apache.hadoop:hadoop-aws:jar:3.3.6:runtime
[WARNING] com.amazonaws:aws-java-sdk-bundle:jar:1.12.328:runtime
[WARNING] org.apache.hadoop:hadoop-azure:jar:3.3.6:runtime
[WARNING] com.google.cloud.bigdataoss:gcs-connector:jar:hadoop3-2.2.22:runtime
[WARNING] org.junit.jupiter:junit-jupiter-engine:jar:5.9.0:test
[WARNING] org.projectlombok:lombok:jar:1.18.30:provided

<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>gcs-connector</artifactId>
</dependency>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing these for the same reason as above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without these jars, user won't be able to run the xtable jar sync for s3 or gcs paths ?

@vinishjail97
Copy link
Contributor

With these changes bundled xtable-utilities jar size is coming around 160 MB. it was coming around 600MB before

Thanks for the optimizations @vamsikarnika, added some comments. Can you run the new jar with demos to confirm nothing breaks ? I highly doubt s3/gcs sync will fail without the dependencies for s3/gcs connectors.

@vamsikarnika
Copy link
Author

vamsikarnika commented Sep 19, 2024

With these changes bundled xtable-utilities jar size is coming around 160 MB. it was coming around 600MB before

Thanks for the optimizations @vamsikarnika, added some comments. Can you run the new jar with demos to confirm nothing breaks ? I highly doubt s3/gcs sync will fail without the dependencies for s3/gcs connectors.

Hey @vinishjail97. I'm facing issues running the demos locally in my mac machine. I'm getting segmentation fault while trying to the run the below command. (I'm using M2 Mac )

java -jar xtable-utilities/target/xtable-utilities-0.2.0-SNAPSHOT-bundled.jar --datasetConfig my_config.yaml

Adding terminal crash report below

-------------------------------------
Translated Report (Full Report Below)
-------------------------------------

Process:               Terminal [51019]
Path:                  /System/Applications/Utilities/Terminal.app/Contents/MacOS/Terminal
Identifier:            com.apple.Terminal
Version:               2.13 (447)
Build Info:            Terminal-447000000000000~1296
Code Type:             ARM-64 (Native)
Parent Process:        launchd [1]
User ID:               501

Date/Time:             2024-09-19 14:04:33.6076 +0530
OS Version:            macOS 13.4.1 (22F770820d)
Report Version:        12
Anonymous UUID:        C6BC4607-2EAC-FD44-043D-E0ECE9D0D67E

Sleep/Wake UUID:       CE8D2B4E-2C85-4BE8-A588-C203561F81AB

Time Awake Since Boot: 65000 seconds
Time Since Wake:       1707 seconds

System Integrity Protection: enabled

Crashed Thread:        0  Dispatch queue: com.apple.main-thread

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_PROTECTION_FAILURE at 0x000000016ebffd00
Exception Codes:       0x0000000000000002, 0x000000016ebffd00

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [51019]

I'm seeing this error during dynamic attaching of jar.

2024-09-19 14:27:03 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:133 - Loading HoodieTableMetaClient from file:/tmp/hudi-dataset/people/.hoodie/metadata
2024-09-19 14:27:03 INFO  org.apache.hudi.common.table.HoodieTableConfig:276 - Loading table properties from file:/tmp/hudi-dataset/people/.hoodie/metadata/.hoodie/hoodie.properties
2024-09-19 14:27:03 INFO  org.apache.hudi.common.table.HoodieTableMetaClient:152 - Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE) from file:/tmp/hudi-dataset/people/.hoodie/metadata
# WARNING: Unable to get Instrumentation. Dynamic Attach failed. You may add this JAR as -javaagent manually, or supply -Djdk.attach.allowAttachSelf

@vamsikarnika
Copy link
Author

With these changes bundled xtable-utilities jar size is coming around 160 MB. it was coming around 600MB before

Thanks for the optimizations @vamsikarnika, added some comments. Can you run the new jar with demos to confirm nothing breaks ? I highly doubt s3/gcs sync will fail without the dependencies for s3/gcs connectors.

Yeah, you are right. we need these deps during runtime. mvn dependency:analyze only checks dependencies required during compile time.

I've removed some of the dependencies like aws-sdk-bundle and confirmed sync is still working with s3. But after these changes jar size hasn't reduced by much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants